DF-23489 multinode: tighten polling criteria for liveness#101
Open
vlfig wants to merge 2 commits intorm-racy-assertionsfrom
Open
DF-23489 multinode: tighten polling criteria for liveness#101vlfig wants to merge 2 commits intorm-racy-assertionsfrom
vlfig wants to merge 2 commits intorm-racy-assertionsfrom
Conversation
|
a99c652 to
6a44457
Compare
Successful polls now decrement the failure count, don't fully reset it. So that nodes (RPCs) are eventually declared unreachable if they sustain poll error rates above 1:1.
6a44457 to
e242962
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Having ship-shape RPCs is crucial for keeping the odds of missing a transmit as low as possible, which is itself crucial for SVR. It is our suspicion (and telemetry concurs) that our nodes are too lenient on RPCs with respect to their polling failures and that there are gains to be had in booting those out of the alive pool. In particular, our nodes do not detect grey-failures — when RPCs fail polling intermittently under a certain rate.
This changes the behaviour in both directions between the unreachable and alive states:
Alive, mirroring the other direction and ensuring a node doesn't make it back into alive in the same condition that got it kicked out in the first place. However, instead of following the same "decaying" counter we opted for a stricter flow where a single poll failure resets the node back to square one (dialing). This new phase/step is governed by a new config propertyPollSuccessThresholdthat, defaulting to0matches current behavior and allows for progressive, explicit rollout.Implements DF-23489.
Requires Dependencies
None.
Resolves Dependencies
None.